Survival of the Fittest: Cox Regression

Capstone 6940

Hazardous Conditions: Kayla Boyd & Hermela Shimelis

Intro to Cox Regression

  • Cox regression is a popular regression modeling method used to predict survival rates given certain covariates.

    • “Survival” can refer to the development of a symptom, time to relapse after remission, or as a time to death [1].

    • Cox regression model is based on the hazard function h_x(t) with covariates at time t given by:

      h_x(t)=h_0(t)\exp(\beta_1x_1 +\beta_2x_2 + \dots + \beta_p x_n) h_x(t) = hazard \ function\ h_0(t) = Baseline \ hazard \ function\

      \beta_1x_1 +\beta_2x_2 + \dots + \beta_p x_n = linear\ combination\ of \ covariates\ and\ their\ coefficients


Methods: Making Sense of Cox Regression

  • The assumption of a constant relationship between dependent and explanatory variables is called proportional hazards [2].

  • The hazard function is the probability that an individual will experience an event (death) within a certain time interval [1].

    • If the risk factor is binary, the result can be interpreted as the estimated increase in HR in patients with the risk factor vs. those without [3].

    • If the risk factor is continuous, the HR is interpreted as an increase/decrease in the hazard rate of death due to a 1 unit increase in the variable [3].

Graphical strategies to assess proportionality assumption:

  • Kaplan-Meier Survival Distribution S(t): Plot (S(t)) as a function of survival time for each level covariate.

  • Plot the function (\log(-\log(S(t)))) as a function of the log survival time (log represent natural log).

  • Schoenfeld Residuals

  • Failing to meet the assumption of proportional hazards means that the effects between dependent and explanatory variables are not constant over time.

  • Time-varying covariates (coefficients) are used when a covariate changes over time during the follow-up period [4].

    • Example: The effect of the size of a patient’s tumor on their chances of survival.
  • Internal time-varying coefficients are affected by survival status and include values that are generated by the subject [4].

    • A patient’s blood pressure levels during a study on cardiovascular events.
  • External time-varying coefficients are pre-determined and external to the subject under study [4].

    • Pollen levels during a study on patients with asthma.

Analysis: Corporate Survival

By Kayla Boyd

Data Source: Employee Turnover Data from Kaggle

Variable Description
stag Experience (time)
event Employee turnover
gender Employee’s gender
age Employee’s age (year)
industry Employee’s industry
profession Employee’s profession
traffic From what pipelene employee came to the company.
coach Presence of a coach (training) on probation
head_gender Head (supervisor) gender
greywage The salary does not seem to the tax authorities.
way Employee’s way of transportation
extraversion Extroversion score
independ Independence score
selfcontrol Self control score
anxiety Anxiety score
novator Innovator score
{r, warning=FALSE, echo=F, message=FALSE}

library(readr)
turnover <- read_csv("turnover.csv")
View(turnover)

library(tidyverse) 
library(dplyr)
library(corrplot)
library(RColorBrewer)
library(ggfortify)
library(riskRegression)
library(survival)
library(stringr)
library(zoo)
library(ranger)
library(ggplot2)
library(readxl)
library(MASS)
library(ADGofTest)
library(survminer)
library(car)

# Check for null and duplicate values

cat("Number of missing values :", sum(is.na(turnover)))
cat("Number of duplicates: ", turnover %>%
    duplicated() %>%
    sum())
# Remove duplicate values

turnover <- unique(turnover)

# Check distribution of data

turnover %>%
  ggplot(aes(x = stag, color = factor(event),
                              fill = factor(event))) +
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
  geom_density(alpha = 0.05) + scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") + theme_minimal() + 
  theme(legend.position = "top")

# Check for censored observations

n <- dim(turnover)[1]
cat((n - sum(turnover$event))/n * 100, "% of observations are censored")

# Separate variable based on type

NUM_COLS <- c("stag", "age", "extraversion", "independ", "selfcontrol", "anxiety", "novator")
CAT_COLS <- c("gender", "industry", "profession", "traffic", "coach", "head_gender", "greywage", "way")

# Transform CAT_COLS into categorical type

    for (COL in CAT_COLS){
    turnover[COL] <- turnover[COL] %>% unlist() %>% factor()}
Number of missing values : 0
Number of duplicates:  13

49.82079 % of observations are censored

Results

# Categorical covariates

# Selection of covariates (that are only discrete)
turnover.cat <- turnover %>% 
                select_if(is.factor) %>% 
                mutate(event = turnover$event)

turnover.cat %>% ggplot(aes(x = gender, color = factor(event),
                            fill = factor(event))) +
  geom_bar(alpha = 0.5) +
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover.cat %>% ggplot(aes(x = industry, color = factor(event),
                            fill = factor(event))) +
  geom_bar(alpha = 0.5) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover.cat %>% ggplot(aes(x = profession, color = factor(event),
                            fill = factor(event))) +
  geom_bar(alpha = 0.5) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover.cat %>% ggplot(aes(x = traffic, color = factor(event),
                            fill = factor(event))) +
  geom_bar(alpha = 0.5) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover.cat %>% ggplot(aes(x = coach, color = factor(event),
                            fill = factor(event))) +
  geom_bar(alpha = 0.5) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover.cat %>% ggplot(aes(x = head_gender, color = factor(event),
                            fill = factor(event))) +
  geom_bar(alpha = 0.5) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover.cat %>% ggplot(aes(x = greywage, color = factor(event),
                            fill = factor(event))) +
  geom_bar(alpha = 0.5) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover.cat %>% ggplot(aes(x = way, color = factor(event), 
                                     fill = factor(event))) +
  geom_bar(alpha = 0.5) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

# Continuous covariates
turnover %>% ggplot(aes(x = stag, color = factor(event), 
                        fill = factor(event))) +
  geom_histogram(aes(y = ..density..), alpha = 0.5) + 
  geom_density(alpha = 0.05) +  
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover %>% ggplot(aes(x = event, color = factor(event),
                        fill = factor(event))) +
  geom_histogram(aes(y = ..density..), alpha = 0.5) + 
  geom_density(alpha = 0.05) +
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") + 
  theme_minimal() + theme(legend.position = "top")

turnover %>% ggplot(aes(x = age, color = factor(event),
                        fill = factor(event))) +
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
  geom_density(alpha = 0.05) +
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") + 
  theme_minimal() + theme(legend.position = "top")

turnover %>% ggplot(aes(x = extraversion, color = factor(event), 
                        fill = factor(event))) + 
  geom_histogram(aes(y = ..density..), alpha = 0.5) + 
  geom_density(alpha = 0.05) + scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") + theme_minimal() + 
  theme(legend.position = "top")

turnover %>% ggplot(aes(x = independ, color = factor(event), 
                        fill = factor(event))) +
  geom_histogram(aes(y = ..density..), alpha = 0.5) + 
  geom_density(alpha = 0.05) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") + 
  theme_minimal() + theme(legend.position = "top")

turnover %>% ggplot(aes(x = selfcontrol, color = factor(event),
                        fill = factor(event))) +
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
  geom_density(alpha = 0.05) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover %>% ggplot(aes(x = anxiety, color = factor(event),
                        fill = factor(event))) +
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
  geom_density(alpha = 0.05) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

turnover %>% ggplot(aes(x = novator, color = factor(event),
                        fill = factor(event))) +
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
  geom_density(alpha = 0.05) + 
  scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
  theme_minimal() + theme(legend.position = "top")

# Other encoding of discrete variables
turnover.num <- turnover.cat[-9]

levels(turnover.num$gender) <- 1:length(levels(turnover.cat$gender))
levels(turnover.num$industry) <- 1:length(levels(turnover.cat$industry))
levels(turnover.num$profession) <- 1:length(levels(turnover.cat$profession))
levels(turnover.num$traffic) <- 1:length(levels(turnover.cat$traffic))
levels(turnover.num$coach) <- 1:length(levels(turnover.cat$coach))
levels(turnover.num$head_gender) <- 1:length(levels(turnover.cat$head_gender))
levels(turnover.num$greywage) <- 1:length(levels(turnover.cat$greywage))
levels(turnover.num$way) <- 1:length(levels(turnover.cat$way))
turnover.num <- as.data.frame(apply(turnover.num, 2, as.numeric))

turnover.num <- cbind(turnover.num, turnover %>%
    select_if(is.numeric))

# Correlation matrix
corrplot(cor(turnover.num), col = brewer.pal(10, "BrBG"), method = "square", diag = FALSE)

Survival Plots

Now we will plot our categorical variables (gender, industry, profession, traffic, coach, head_gender, greywage, and way) using the Kaplan-Meier Estimation Curve. The Kaplan-Meier allows us to visualize how each category of survival probability differs from the others.

km_gender<-survfit(Surv(stag, event)~gender, data = turnover, type="kaplan-meier")

ggsurvplot(km_gender, data=turnover,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           legend.labs = c("female", "male"),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Gender")

km_industry<-survfit(Surv(stag, event)~industry, data = turnover, type ="kaplan-meier")
ggsurvplot(km_industry, data=turnover,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Industry")

km_profession<-survfit(Surv(stag, event)~profession, data = turnover, type="kaplan-meier")
ggsurvplot(km_profession, data=turnover,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Profession")

km_traffic<-survfit(Surv(stag, event)~traffic, data= turnover, type="kaplan-meier")
ggsurvplot(km_traffic, data=turnover,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Traffic")

km_coach<-survfit(Surv(stag, event)~coach, data = turnover, type="kaplan-meier")
ggsurvplot(km_coach, data=turnover,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Coach")

km_headgender<-survfit(Surv(stag, event)~head_gender, data= turnover, type="kaplan-meier")
ggsurvplot(km_headgender, data=turnover,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           legend.labs = c("female", "male"),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Head Gender")

km_greywage<-survfit(Surv(stag, event)~greywage, data = turnover, type ="kaplan-meier")
ggsurvplot(km_greywage, data=turnover,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           legend.labs = c("grey", "white"),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Greywage")

km_way<-survfit(Surv(stag, event)~way,
                data= turnover,
                type="kaplan-meier")
ggsurvplot(km_way, data=turnover,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           legend.labs=c("bus", "car", "foot"),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Commuters(way)")

Modeling the Outcomes

model0<-coxph(Surv(stag, event)~.,
                 data = turnover)
summary(model0)
Call:
coxph(formula = Surv(stag, event) ~ ., data = turnover)

  n= 1116, number of events= 560 

                                   coef exp(coef)  se(coef)      z Pr(>|z|)    
genderm                       -0.110567  0.895327  0.127236 -0.869 0.384855    
age                            0.021321  1.021550  0.006992  3.050 0.002292 ** 
industryBanks                 -0.228597  0.795649  0.370386 -0.617 0.537112    
industryBuilding              -0.228859  0.795441  0.394945 -0.579 0.562272    
industryConsult               -0.387826  0.678530  0.378031 -1.026 0.304934    
industryetc                   -0.571840  0.564486  0.376067 -1.521 0.128365    
industryHoReCa                -0.641488  0.526508  0.545700 -1.176 0.239782    
industryIT                    -1.188633  0.304637  0.392874 -3.025 0.002482 ** 
industrymanufacture           -0.799654  0.449484  0.373800 -2.139 0.032415 *  
industryMining                -0.603336  0.546984  0.449444 -1.342 0.179465    
industryPharma                -1.005353  0.365916  0.480887 -2.091 0.036562 *  
industryPowerGeneration       -1.032475  0.356124  0.451676 -2.286 0.022261 *  
industryRealEstate            -1.725592  0.178068  0.588426 -2.933 0.003362 ** 
industryRetail                -1.042228  0.352668  0.363732 -2.865 0.004165 ** 
industryState                 -0.667077  0.513206  0.410168 -1.626 0.103875    
industryTelecom               -1.186918  0.305160  0.448385 -2.647 0.008119 ** 
industrytransport             -0.852135  0.426504  0.427996 -1.991 0.046482 *  
professionBusinessDevelopment  0.600890  1.823741  0.508792  1.181 0.237597    
professionCommercial           0.998866  2.715201  0.507452  1.968 0.049022 *  
professionConsult              0.570631  1.769384  0.520478  1.096 0.272921    
professionEngineer             0.998848  2.715151  0.538726  1.854 0.063726 .  
professionetc                  0.486204  1.626132  0.486495  0.999 0.317600    
professionFinance              0.054707  1.056231  0.526353  0.104 0.917220    
professionHR                   0.202198  1.224091  0.429582  0.471 0.637865    
professionIT                   0.069725  1.072213  0.491667  0.142 0.887228    
professionLaw                  0.403392  1.496894  0.647872  0.623 0.533520    
professionmanage               1.284135  3.611541  0.500159  2.567 0.010245 *  
professionMarketing            0.726077  2.066956  0.482551  1.505 0.132411    
professionPR                   0.846313  2.331035  0.640073  1.322 0.186097    
professionSales                0.505293  1.657472  0.467728  1.080 0.280002    
professionTeaching             0.617169  1.853673  0.569441  1.084 0.278446    
trafficempjs                   0.928219  2.529999  0.314661  2.950 0.003179 ** 
trafficfriends                 0.122161  1.129937  0.342264  0.357 0.721151    
trafficKA                      0.141661  1.152186  0.353556  0.401 0.688659    
trafficrabrecNErab             0.548112  1.729984  0.310084  1.768 0.077124 .  
trafficrecNErab               -0.051263  0.950029  0.381447 -0.134 0.893095    
trafficreferal                 0.368137  1.445041  0.325262  1.132 0.257711    
trafficyoujs                   0.654608  1.924387  0.309447  2.115 0.034395 *  
coachno                        0.056040  1.057640  0.111493  0.503 0.615221    
coachyes                       0.211061  1.234988  0.151771  1.391 0.164329    
head_genderm                   0.055167  1.056717  0.102616  0.538 0.590846    
greywagewhite                 -0.505741  0.603059  0.135383 -3.736 0.000187 ***
waycar                        -0.201625  0.817401  0.103831 -1.942 0.052153 .  
wayfoot                       -0.402833  0.668424  0.174165 -2.313 0.020726 *  
extraversion                   0.016623  1.016761  0.035460  0.469 0.639236    
independ                      -0.019380  0.980806  0.035771 -0.542 0.587962    
selfcontrol                   -0.045743  0.955287  0.035878 -1.275 0.202321    
anxiety                       -0.048800  0.952372  0.034633 -1.409 0.158827    
novator                        0.009108  1.009150  0.030631  0.297 0.766196    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                              exp(coef) exp(-coef) lower .95 upper .95
genderm                          0.8953     1.1169    0.6977    1.1489
age                              1.0216     0.9789    1.0076    1.0356
industryBanks                    0.7956     1.2568    0.3850    1.6444
industryBuilding                 0.7954     1.2572    0.3668    1.7250
industryConsult                  0.6785     1.4738    0.3234    1.4235
industryetc                      0.5645     1.7715    0.2701    1.1797
industryHoReCa                   0.5265     1.8993    0.1807    1.5343
industryIT                       0.3046     3.2826    0.1410    0.6580
industrymanufacture              0.4495     2.2248    0.2160    0.9352
industryMining                   0.5470     1.8282    0.2267    1.3199
industryPharma                   0.3659     2.7329    0.1426    0.9391
industryPowerGeneration          0.3561     2.8080    0.1469    0.8631
industryRealEstate               0.1781     5.6158    0.0562    0.5642
industryRetail                   0.3527     2.8355    0.1729    0.7194
industryState                    0.5132     1.9485    0.2297    1.1466
industryTelecom                  0.3052     3.2770    0.1267    0.7348
industrytransport                0.4265     2.3446    0.1843    0.9868
professionBusinessDevelopment    1.8237     0.5483    0.6728    4.9436
professionCommercial             2.7152     0.3683    1.0043    7.3408
professionConsult                1.7694     0.5652    0.6380    4.9074
professionEngineer               2.7152     0.3683    0.9446    7.8047
professionetc                    1.6261     0.6150    0.6267    4.2195
professionFinance                1.0562     0.9468    0.3765    2.9634
professionHR                     1.2241     0.8169    0.5274    2.8410
professionIT                     1.0722     0.9327    0.4090    2.8105
professionLaw                    1.4969     0.6681    0.4205    5.3292
professionmanage                 3.6115     0.2769    1.3551    9.6256
professionMarketing              2.0670     0.4838    0.8028    5.3221
professionPR                     2.3310     0.4290    0.6648    8.1730
professionSales                  1.6575     0.6033    0.6627    4.1455
professionTeaching               1.8537     0.5395    0.6072    5.6590
trafficempjs                     2.5300     0.3953    1.3655    4.6877
trafficfriends                   1.1299     0.8850    0.5777    2.2100
trafficKA                        1.1522     0.8679    0.5762    2.3039
trafficrabrecNErab               1.7300     0.5780    0.9421    3.1768
trafficrecNErab                  0.9500     1.0526    0.4498    2.0064
trafficreferal                   1.4450     0.6920    0.7639    2.7337
trafficyoujs                     1.9244     0.5196    1.0493    3.5293
coachno                          1.0576     0.9455    0.8500    1.3160
coachyes                         1.2350     0.8097    0.9172    1.6628
head_genderm                     1.0567     0.9463    0.8642    1.2921
greywagewhite                    0.6031     1.6582    0.4625    0.7863
waycar                           0.8174     1.2234    0.6669    1.0019
wayfoot                          0.6684     1.4961    0.4751    0.9404
extraversion                     1.0168     0.9835    0.9485    1.0899
independ                         0.9808     1.0196    0.9144    1.0520
selfcontrol                      0.9553     1.0468    0.8904    1.0249
anxiety                          0.9524     1.0500    0.8899    1.0193
novator                          1.0092     0.9909    0.9503    1.0716

Concordance= 0.661  (se = 0.012 )
Likelihood ratio test= 173.8  on 49 df,   p=7e-16
Wald test            = 176.7  on 49 df,   p=3e-16
Score (logrank) test = 183.5  on 49 df,   p=<2e-16

We want to refine the model to use the independent variable(s) with the greatest influence from our crude model. To do this, we will use the “step” function in R which iteratively adds or removes variables based on a chosen criterion. For this model we will use the AIC value. The Akaike Information Criterion (AIC) is a statistical metric that evaluates how well a model fits a given set of data.

step(model0)
Start:  AIC=6717.67
Surv(stag, event) ~ gender + age + industry + profession + traffic + 
    coach + head_gender + greywage + way + extraversion + independ + 
    selfcontrol + anxiety + novator

               Df    AIC
- coach         2 6715.6
- novator       1 6715.8
- extraversion  1 6715.9
- head_gender   1 6716.0
- independ      1 6716.0
- gender        1 6716.4
- selfcontrol   1 6717.3
- anxiety       1 6717.7
<none>            6717.7
- profession   14 6720.5
- way           2 6721.8
- age           1 6724.8
- greywage      1 6728.3
- traffic       7 6741.6
- industry     15 6746.9

Step:  AIC=6715.58
Surv(stag, event) ~ gender + age + industry + profession + traffic + 
    head_gender + greywage + way + extraversion + independ + 
    selfcontrol + anxiety + novator

               Df    AIC
- novator       1 6713.8
- head_gender   1 6713.8
- extraversion  1 6713.8
- independ      1 6713.9
- gender        1 6714.3
- selfcontrol   1 6715.2
<none>            6715.6
- anxiety       1 6715.7
- profession   14 6718.8
- way           2 6719.2
- age           1 6722.8
- greywage      1 6726.4
- traffic       7 6739.4
- industry     15 6746.4

Step:  AIC=6713.76
Surv(stag, event) ~ gender + age + industry + profession + traffic + 
    head_gender + greywage + way + extraversion + independ + 
    selfcontrol + anxiety

               Df    AIC
- extraversion  1 6712.1
- head_gender   1 6712.1
- independ      1 6712.1
- gender        1 6712.5
- anxiety       1 6713.7
<none>            6713.8
- selfcontrol   1 6714.5
- profession   14 6716.9
- way           2 6717.3
- age           1 6721.1
- greywage      1 6724.4
- traffic       7 6737.5
- industry     15 6744.5

Step:  AIC=6712.05
Surv(stag, event) ~ gender + age + industry + profession + traffic + 
    head_gender + greywage + way + independ + selfcontrol + anxiety

              Df    AIC
- head_gender  1 6710.4
- gender       1 6710.8
- independ     1 6711.0
<none>           6712.1
- anxiety      1 6713.4
- profession  14 6715.5
- way          2 6715.6
- selfcontrol  1 6717.3
- age          1 6719.1
- greywage     1 6722.5
- traffic      7 6736.4
- industry    15 6743.3

Step:  AIC=6710.35
Surv(stag, event) ~ gender + age + industry + profession + traffic + 
    greywage + way + independ + selfcontrol + anxiety

              Df    AIC
- gender       1 6709.0
- independ     1 6709.4
<none>           6710.4
- anxiety      1 6711.9
- way          2 6713.9
- profession  14 6714.1
- selfcontrol  1 6715.9
- age          1 6719.1
- greywage     1 6720.9
- traffic      7 6734.4
- industry    15 6741.9

Step:  AIC=6709
Surv(stag, event) ~ age + industry + profession + traffic + greywage + 
    way + independ + selfcontrol + anxiety

              Df    AIC
- independ     1 6708.2
<none>           6709.0
- anxiety      1 6711.7
- profession  14 6712.1
- way          2 6712.8
- selfcontrol  1 6715.3
- age          1 6717.6
- greywage     1 6720.0
- traffic      7 6733.2
- industry    15 6740.7

Step:  AIC=6708.17
Surv(stag, event) ~ age + industry + profession + traffic + greywage + 
    way + selfcontrol + anxiety

              Df    AIC
<none>           6708.2
- anxiety      1 6709.7
- profession  14 6710.9
- way          2 6712.0
- selfcontrol  1 6713.4
- age          1 6716.1
- greywage     1 6719.4
- traffic      7 6732.4
- industry    15 6739.6
Call:
coxph(formula = Surv(stag, event) ~ age + industry + profession + 
    traffic + greywage + way + selfcontrol + anxiety, data = turnover)

                                  coef exp(coef) se(coef)      z        p
age                            0.02055   1.02076  0.00641  3.205 0.001349
industryBanks                 -0.28975   0.74845  0.36289 -0.798 0.424614
industryBuilding              -0.26582   0.76658  0.38880 -0.684 0.494175
industryConsult               -0.45691   0.63324  0.36927 -1.237 0.215967
industryetc                   -0.64403   0.52517  0.36896 -1.746 0.080890
industryHoReCa                -0.77805   0.45930  0.54017 -1.440 0.149758
industryIT                    -1.24814   0.28704  0.38553 -3.237 0.001206
industrymanufacture           -0.87933   0.41506  0.36757 -2.392 0.016744
industryMining                -0.65940   0.51716  0.44284 -1.489 0.136485
industryPharma                -1.04233   0.35263  0.47232 -2.207 0.027324
industryPowerGeneration       -1.09181   0.33561  0.44269 -2.466 0.013651
industryRealEstate            -1.81850   0.16227  0.58090 -3.130 0.001745
industryRetail                -1.10739   0.33042  0.35511 -3.118 0.001818
industryState                 -0.73784   0.47814  0.40157 -1.837 0.066154
industryTelecom               -1.25500   0.28508  0.43786 -2.866 0.004154
industrytransport             -0.86358   0.42165  0.42212 -2.046 0.040775
professionBusinessDevelopment  0.59092   1.80564  0.50153  1.178 0.238702
professionCommercial           1.00727   2.73813  0.49840  2.021 0.043279
professionConsult              0.54819   1.73012  0.50936  1.076 0.281828
professionEngineer             0.94501   2.57284  0.52648  1.795 0.072662
professionetc                  0.45756   1.58021  0.48289  0.948 0.343360
professionFinance              0.05804   1.05976  0.51797  0.112 0.910784
professionHR                   0.22086   1.24714  0.42463  0.520 0.602987
professionIT                   0.02934   1.02978  0.47418  0.062 0.950654
professionLaw                  0.31703   1.37304  0.64071  0.495 0.620736
professionmanage               1.28255   3.60582  0.49774  2.577 0.009974
professionMarketing            0.70287   2.01955  0.47822  1.470 0.141624
professionPR                   0.82179   2.27458  0.63824  1.288 0.197889
professionSales                0.50708   1.66044  0.45661  1.111 0.266774
professionTeaching             0.62625   1.87058  0.56833  1.102 0.270502
trafficempjs                   0.85644   2.35476  0.30998  2.763 0.005729
trafficfriends                 0.03493   1.03555  0.33616  0.104 0.917244
trafficKA                      0.10265   1.10810  0.34993  0.293 0.769274
trafficrabrecNErab             0.48122   1.61805  0.30601  1.573 0.115817
trafficrecNErab               -0.12679   0.88092  0.37753 -0.336 0.736996
trafficreferal                 0.30802   1.36073  0.32125  0.959 0.337642
trafficyoujs                   0.60301   1.82761  0.30666  1.966 0.049257
greywagewhite                 -0.51396   0.59812  0.13401 -3.835 0.000125
waycar                        -0.21103   0.80975  0.10259 -2.057 0.039682
wayfoot                       -0.37776   0.68539  0.17333 -2.179 0.029301
selfcontrol                   -0.06082   0.94099  0.02269 -2.680 0.007353
anxiety                       -0.04924   0.95196  0.02619 -1.880 0.060130

Likelihood ratio test=169.3  on 42 df, p=< 2.2e-16
n= 1116, number of events= 560 

Using these results, we can see that the model including independent variables age, industry, profession, traffic, greywage, selfcontrol, and anxiety produced the most accurate survival model. We will use these variables in our adjusted model.

model1<-coxph(Surv(stag, event)~age + industry + profession + traffic + greywage + way + selfcontrol + anxiety,
                data = turnover)
summary(model1)
Call:
coxph(formula = Surv(stag, event) ~ age + industry + profession + 
    traffic + greywage + way + selfcontrol + anxiety, data = turnover)

  n= 1116, number of events= 560 

                                  coef exp(coef) se(coef)      z Pr(>|z|)    
age                            0.02055   1.02076  0.00641  3.205 0.001349 ** 
industryBanks                 -0.28975   0.74845  0.36289 -0.798 0.424614    
industryBuilding              -0.26582   0.76658  0.38880 -0.684 0.494175    
industryConsult               -0.45691   0.63324  0.36927 -1.237 0.215967    
industryetc                   -0.64403   0.52517  0.36896 -1.746 0.080890 .  
industryHoReCa                -0.77805   0.45930  0.54017 -1.440 0.149758    
industryIT                    -1.24814   0.28704  0.38553 -3.237 0.001206 ** 
industrymanufacture           -0.87933   0.41506  0.36757 -2.392 0.016744 *  
industryMining                -0.65940   0.51716  0.44284 -1.489 0.136485    
industryPharma                -1.04233   0.35263  0.47232 -2.207 0.027324 *  
industryPowerGeneration       -1.09181   0.33561  0.44269 -2.466 0.013651 *  
industryRealEstate            -1.81850   0.16227  0.58090 -3.130 0.001745 ** 
industryRetail                -1.10739   0.33042  0.35511 -3.118 0.001818 ** 
industryState                 -0.73784   0.47814  0.40157 -1.837 0.066154 .  
industryTelecom               -1.25500   0.28508  0.43786 -2.866 0.004154 ** 
industrytransport             -0.86358   0.42165  0.42212 -2.046 0.040775 *  
professionBusinessDevelopment  0.59092   1.80564  0.50153  1.178 0.238702    
professionCommercial           1.00727   2.73813  0.49840  2.021 0.043279 *  
professionConsult              0.54819   1.73012  0.50936  1.076 0.281828    
professionEngineer             0.94501   2.57284  0.52648  1.795 0.072662 .  
professionetc                  0.45756   1.58021  0.48289  0.948 0.343360    
professionFinance              0.05804   1.05976  0.51797  0.112 0.910784    
professionHR                   0.22086   1.24714  0.42463  0.520 0.602987    
professionIT                   0.02934   1.02978  0.47418  0.062 0.950654    
professionLaw                  0.31703   1.37304  0.64071  0.495 0.620736    
professionmanage               1.28255   3.60582  0.49774  2.577 0.009974 ** 
professionMarketing            0.70287   2.01955  0.47822  1.470 0.141624    
professionPR                   0.82179   2.27458  0.63824  1.288 0.197889    
professionSales                0.50708   1.66044  0.45661  1.111 0.266774    
professionTeaching             0.62625   1.87058  0.56833  1.102 0.270502    
trafficempjs                   0.85644   2.35476  0.30998  2.763 0.005729 ** 
trafficfriends                 0.03493   1.03555  0.33616  0.104 0.917244    
trafficKA                      0.10265   1.10810  0.34993  0.293 0.769274    
trafficrabrecNErab             0.48122   1.61805  0.30601  1.573 0.115817    
trafficrecNErab               -0.12679   0.88092  0.37753 -0.336 0.736996    
trafficreferal                 0.30802   1.36073  0.32125  0.959 0.337642    
trafficyoujs                   0.60301   1.82761  0.30666  1.966 0.049257 *  
greywagewhite                 -0.51396   0.59812  0.13401 -3.835 0.000125 ***
waycar                        -0.21103   0.80975  0.10259 -2.057 0.039682 *  
wayfoot                       -0.37776   0.68539  0.17333 -2.179 0.029301 *  
selfcontrol                   -0.06082   0.94099  0.02269 -2.680 0.007353 ** 
anxiety                       -0.04924   0.95196  0.02619 -1.880 0.060130 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                              exp(coef) exp(-coef) lower .95 upper .95
age                              1.0208     0.9797   1.00801    1.0337
industryBanks                    0.7485     1.3361   0.36751    1.5243
industryBuilding                 0.7666     1.3045   0.35777    1.6425
industryConsult                  0.6332     1.5792   0.30707    1.3058
industryetc                      0.5252     1.9041   0.25483    1.0823
industryHoReCa                   0.4593     2.1772   0.15933    1.3240
industryIT                       0.2870     3.4839   0.13483    0.6111
industrymanufacture              0.4151     2.4093   0.20195    0.8531
industryMining                   0.5172     1.9336   0.21711    1.2319
industryPharma                   0.3526     2.8358   0.13973    0.8899
industryPowerGeneration          0.3356     2.9797   0.14093    0.7992
industryRealEstate               0.1623     6.1626   0.05197    0.5066
industryRetail                   0.3304     3.0264   0.16474    0.6627
industryState                    0.4781     2.0914   0.21764    1.0505
industryTelecom                  0.2851     3.5078   0.12085    0.6725
industrytransport                0.4217     2.3716   0.18435    0.9644
professionBusinessDevelopment    1.8056     0.5538   0.67567    4.8254
professionCommercial             2.7381     0.3652   1.03089    7.2727
professionConsult                1.7301     0.5780   0.63753    4.6951
professionEngineer               2.5728     0.3887   0.91679    7.2203
professionetc                    1.5802     0.6328   0.61331    4.0715
professionFinance                1.0598     0.9436   0.38398    2.9249
professionHR                     1.2471     0.8018   0.54258    2.8666
professionIT                     1.0298     0.9711   0.40656    2.6084
professionLaw                    1.3730     0.7283   0.39112    4.8201
professionmanage                 3.6058     0.2773   1.35933    9.5650
professionMarketing              2.0195     0.4952   0.79103    5.1560
professionPR                     2.2746     0.4396   0.65107    7.9465
professionSales                  1.6604     0.6023   0.67850    4.0634
professionTeaching               1.8706     0.5346   0.61406    5.6983
trafficempjs                     2.3548     0.4247   1.28261    4.3232
trafficfriends                   1.0355     0.9657   0.53583    2.0013
trafficKA                        1.1081     0.9024   0.55810    2.2001
trafficrabrecNErab               1.6181     0.6180   0.88821    2.9476
trafficrecNErab                  0.8809     1.1352   0.42032    1.8463
trafficreferal                   1.3607     0.7349   0.72498    2.5540
trafficyoujs                     1.8276     0.5472   1.00196    3.3336
greywagewhite                    0.5981     1.6719   0.45996    0.7778
waycar                           0.8098     1.2349   0.66226    0.9901
wayfoot                          0.6854     1.4590   0.48798    0.9627
selfcontrol                      0.9410     1.0627   0.90005    0.9838
anxiety                          0.9520     1.0505   0.90432    1.0021

Concordance= 0.66  (se = 0.012 )
Likelihood ratio test= 169.3  on 42 df,   p=<2e-16
Wald test            = 172.2  on 42 df,   p=<2e-16
Score (logrank) test = 178.8  on 42 df,   p=<2e-16

Hazard ratios above 1 indicate a higher likelihood of the individuals in that category to experience the event (termination) in the given timeframe.

hr=exp(model1$coefficients)
hr
                          age                 industryBanks 
                    1.0207593                     0.7484527 
             industryBuilding               industryConsult 
                    0.7665782                     0.6332371 
                  industryetc                industryHoReCa 
                    0.5251713                     0.4593025 
                   industryIT           industrymanufacture 
                    0.2870376                     0.4150602 
               industryMining                industryPharma 
                    0.5171611                     0.3526310 
      industryPowerGeneration            industryRealEstate 
                    0.3356074                     0.1622689 
               industryRetail                 industryState 
                    0.3304219                     0.4781437 
              industryTelecom             industrytransport 
                    0.2850769                     0.4216508 
professionBusinessDevelopment          professionCommercial 
                    1.8056453                     2.7381273 
            professionConsult            professionEngineer 
                    1.7301166                     2.5728377 
                professionetc             professionFinance 
                    1.5802152                     1.0597560 
                 professionHR                  professionIT 
                    1.2471440                     1.0297797 
                professionLaw              professionmanage 
                    1.3730396                     3.6058251 
          professionMarketing                  professionPR 
                    2.0195459                     2.2745773 
              professionSales            professionTeaching 
                    1.6604383                     1.8705848 
                 trafficempjs                trafficfriends 
                    2.3547649                     1.0355461 
                    trafficKA            trafficrabrecNErab 
                    1.1080976                     1.6180539 
              trafficrecNErab                trafficreferal 
                    0.8809202                     1.3607294 
                 trafficyoujs                 greywagewhite 
                    1.8276083                     0.5981229 
                       waycar                       wayfoot 
                    0.8097502                     0.6853927 
                  selfcontrol                       anxiety 
                    0.9409886                     0.9519574 

Model Comparison

Crude Model Adjusted Model
Covariate Hazard ratio (95% CI), p-value Hazard ratio (95% CI), p-value
genderm (0.6977,1.1489) 0.384855 Excluded
age (1.0076,1.0356) 0.002292 (1.00801,1.0337) 0.001349
industryBanks (0.385,1.6444) 0.537112 (0.36751,1.5243) 0.424614
industryBuilding (0.3668,1.725) 0.562272 (0.35777,1.6425) 0.494175
industryConsult (0.3234, 1.4235) 0.304934 (0.30707,1.3058) 0.215967
industryetc (0.2701,1.1797) 0.128365 (0.25483,1.0823) 0.08089
industryHoReCa (0.1807,1.5343) 0.239782 (0.15933,1.324) 0.149758
industryIT (0.141,0.658) 0.002482 (0.13483,0.6111) 0.001206
industrymanufacture (0.216,0.9352) 0.032415 (0.20195,0.8531) 0.016744
industryMining (0.2267,1.3199) 0.179465 (0.21711,1.2319) 0.136485
industryPharma (0.1426,0.9391) 0.036562 (0.13973,0.8899) 0.027324
industryPowerGeneration (0.1469,0.8631) 0.022261 (0.14093,0.7992) 0.013651
industryRealEstate (0.0562,0.5642) 0.003362 (0.05197,0.5066) 0.001745
industryRetail (0.1729,0.7194) 0.004165 (0.16474,0.6627) 0.001818
industryState (0.2297,1.1466) 0.103875 (0.21764,1.0505) 0.066154
industryTelecom (0.1267,0.7348) 0.008119 (0.12085,0.6725) 0.004154
industrytransport (0.1843,0.9868) 0.046482 (0.18435,0.9644) 0.040775
professionBusinessDevelopment (0.6728,4.9436) 0.237597 (0.67567,4.8254) 0.238702
professionCommercial (1.0043,7.3408) 0.049022 (1.03089,7.2727) 0.043279
professionConsult (0.638,4.9074) 0.272921 (0.63753,4.6951) 0.281828
professionEngineer (0.9446,7.8047) 0.063726 (0.91679,7.2203) 0.072662
professionetc (0.6267,4.2195) 0.3176 (0.61331,4.0715) 0.34336
professionFinance (0.3765,2.9634) 0.91722 (0.38398,2.9249) 0.910784
professionHR (0.5274,2.841) 0.637865 (0.54258,2.8666) 0.602987
professionIT (0.409,2.8105) 0.887228 (0.40656,2.6084) 0.950654
professionLaw (0.4205,5.3292) 0.53352 (0.39112,4.8201) 0.620736
professionmanage (1.3551,9.6256) 0.010245 (1.35933,9.565) 0.009974
professionMarketing (0.8028,5.3221) 0.132411 (0.79103,5.156) 0.141624
professionPR (0.6648,8.173) 0.186097 (0.65107,7.9465) 0.197889
professionSales (0.6627,4.1455) 0.280002 (0.6785,4.0634) 0.266774
professionTeaching (0.6072,5.659) 0.278446 (0.61406,5.6983) 0.270502
trafficempjs (1.3655,4.6877) 0.003179 (1.28261,4.3232) 0.005729
trafficfriends (0.5777,2.21) 0.721151 (0.53583,2.0013) 0.917244
trafficKA (0.5762,2.3039) 0.688659 (0.5581,2.2001) 0.769274
trafficrabrecNErab (0.9421,3.1768) 0.077124 (0.88821,2.9476) 0.115817
trafficrecNErab (0.4498,2.0064) 0.893095 (0.42032,1.8463) 0.736996
trafficreferal (0.7639,2.7337) 0.257711 (0.72498,2.554) 0.337642
trafficyoujs (1.0493,3.5293) 0.034395 (1.00196,3.3336) 0.049257
coachno (0.85,1.316) 0.615221 Excluded
coachyes (0.9172,1.6628) 0.164329 Excluded
head_genderm (0.8642,1.2921) 0.590846 Excluded
greywagewhite (0.4625,0.7863) 0.000187 (0.45996,0.7778) 0.000125
waycar (0.6669,1.0019) 0.052153 (0.66226,0.9901) 0.039682
wayfoot (0.4751,0.9404) 0.020726 (0.48798,0.9627) 0.029301
extraversion (0.9485,1.0899) 0.639236 Excluded
independ (0.9144,1.052) 0.587962 Excluded
selfcontrol (0.8904,1.0249) 0.202321 (0.90005,0.9838) 0.007353
anxiety (0.8899,1.0193) 0.158827 (0.90432,1.0021) 0.06013
novator (0.9503,1.0716) 0.766196 Excluded

Predictions

We will now use our adjusted model to predict 3-year survival rates between two employees, one age 25 and one age 55. All other variables will remain the same.

new_hire <- data.frame(age = c(25,55), industry = c("IT","IT"), stag= c(36, 36), event= c(0,0), profession = c("manage","manage"), traffic = c("empjs","empjs"), greywage = c("white","white"), way = c("car","car"), independ = c(5,5), selfcontrol = c(5,5), anxiety = c(5,5))

preds <- predict(model1, newdata = new_hire, type = "survival", se.fit = TRUE)

new_hire$prob <- preds$fit
new_hire$lcl <- preds$fit - 1.96*preds$se.fit
new_hire$ucl <- preds$fit + 1.96*preds$se.fit
new_hire

Our results show a survival probability of 38% for the 25-year old, and 17% percent for the 55-year old.

Conclusions

  • Cox regression can be used for any time-to-event data, not necessarily meaning death or disease.
  • Eight variables were said to have a higher influence on survival rate: age, industry, profession, traffic, greywage, way, selfcontrol, anxiety.
  • Study objective is to predict employee attrition to mitigate loss to the company.
  • Companies can provide benefits or stipends (such as bus passes, in this case) to try and retain employees. Specific industries can be focused on to improve retention.

Analysis: Modeling of Survival after Chemotherapy for Colon Cancer

By Hermela Shimelis

  • Data: Survival after chemotherapy for Stage B/C colon cancer [6]

  • Goal: Model the relationship between survival time and treatment groups

  • Predictors:

Category Variables
Treatments

- Observation (no treatment)

- Amisole (Lev)

- Amisole + 5-FU

Patient Characteristics

- Age

- Sex

Tumor Characteristics

- Colon perforation and obstruction

- Adherence to nearby organs

- Tumor differentiation

- Local spread

Exploratory Data Analysis

Kaplan-Meier Curve Stratified by Treatment Groups

# Estimate the median survival time among the three groups
survfit(Surv(time,status) ~ rx, data = colon_surv)
Call: survfit(formula = Surv(time, status) ~ rx, data = colon_surv)

             n events median 0.95LCL 0.95UCL
rx=Obs     315    168   2083    1656    2789
rx=Lev     310    161   2152    1540      NA
rx=Lev+5FU 304    123     NA    2725      NA
# count the number of events after 2080 days, which is the median survival time among the observation group
tt <- colon_surv%>%filter(time > 2083)%>% group_by(rx)%>%summarise(ct = n(),
                                                                   death = sum(status))
# Plot survival curve
fit <- survfit(Surv(time,status) ~ rx, data = colon_surv)
ggsurvplot(fit, data=colon_surv, risk.table = TRUE)

# Estimate the probability of surviving beyond 3000 days
summary(survfit(Surv(time, status) ~ rx, data = colon_surv), times = 3000)
Call: survfit(formula = Surv(time, status) ~ rx, data = colon_surv)

                rx=Obs 
        time       n.risk      n.event     survival      std.err lower 95% CI 
    3.00e+03     6.00e+00     1.68e+02     4.08e-01     3.97e-02     3.37e-01 
upper 95% CI 
    4.94e-01 

                rx=Lev 
        time       n.risk      n.event     survival      std.err lower 95% CI 
    3.00e+03     4.00e+00     1.61e+02     3.92e-01     5.63e-02     2.96e-01 
upper 95% CI 
    5.20e-01 

                rx=Lev+5FU 
        time       n.risk      n.event     survival      std.err lower 95% CI 
    3.00e+03     7.00e+00     1.23e+02     5.61e-01     3.43e-02     4.97e-01 
upper 95% CI 
    6.32e-01 
# compare significant difference in survival times between the three groups
survdiff(Surv(time, status)~ rx, data = colon_surv)
Call:
survdiff(formula = Surv(time, status) ~ rx, data = colon_surv)

             N Observed Expected (O-E)^2/E (O-E)^2/V
rx=Obs     315      168      148      2.58      3.85
rx=Lev     310      161      146      1.52      2.25
rx=Lev+5FU 304      123      157      7.55     11.62

 Chisq= 11.7  on 2 degrees of freedom, p= 0.003 
# Plot survival curve
#| echo: FALSE
#| message: false
#| warning: false
#| include: TRUE
fit <- survfit(Surv(time,status) ~ rx, data = colon_surv)
ggsurvplot(fit, data=colon_surv, risk.table = TRUE)

Cox Regression Models

library(broom)
#| echo: False
#| message: false
#| warning: false
#| include: false
m0 <- coxph(Surv(time, status) ~ 1, data = df)
summary_m0 = summary(m0)
c_index_m0 <- concordance(m0)
#cat("Concordance of the base model:",c_index_m0$concordance)
Characteristic HR1 95% CI1 p-value
rx


    Obs
    Lev 0.97 0.78, 1.21 0.8
    Lev+5FU 0.69 0.55, 0.87 0.002
Concordance = 0.536
1 HR = Hazard Ratio, CI = Confidence Interval
Characteristic HR1 95% CI1 p-value
rx


    Obs
    Lev 0.98 0.79, 1.22 0.9
    Lev+5FU 0.69 0.54, 0.87 0.002
age 1.01 1.00, 1.02 0.083
sex 1.04 0.86, 1.26 0.7
perfor 1.00 0.59, 1.70 >0.9
adhere 1.18 0.92, 1.53 0.2
surg 1.27 1.03, 1.55 0.022
obstruct 1.33 1.06, 1.68 0.015
differentiation


    moderate
    poor 1.43 1.13, 1.82 0.003
    well 1.08 0.78, 1.50 0.6
node4 2.55 2.10, 3.09 <0.001
local_spread


    contiguous
    muscle 0.39 0.23, 0.64 <0.001
    serosa 0.64 0.43, 0.94 0.023
    submucosa 0.29 0.10, 0.83 0.021
Concordance = 0.674
1 HR = Hazard Ratio, CI = Confidence Interval
Characteristic HR1 95% CI1 p-value
rx


    Obs
    Lev 0.99 0.80, 1.23 >0.9
    Lev+5FU 0.69 0.54, 0.87 0.002
age 1.01 1.00, 1.02 0.069
surg 1.28 1.04, 1.56 0.018
obstruct 1.33 1.06, 1.67 0.015
differentiation


    moderate
    poor 1.45 1.15, 1.84 0.002
    well 1.07 0.77, 1.48 0.7
node4 2.53 2.09, 3.07 <0.001
local_spread


    contiguous
    muscle 0.37 0.23, 0.61 <0.001
    serosa 0.61 0.41, 0.89 0.010
    submucosa 0.27 0.09, 0.76 0.014
Concordance = 0.672
1 HR = Hazard Ratio, CI = Confidence Interval
Schoenfeld Residuals Test Results
chisq df p Variable
rx 2.335 2 0.311 rx
age 0.549 1 0.459 age
surg 0.020 1 0.888 surg
obstruct 6.148 1 0.013 obstruct
differentiation 17.459 2 0.000 differentiation
node4 5.662 1 0.017 node4
local_spread 7.083 3 0.069 local_spread
GLOBAL 37.525 11 0.000 GLOBAL
Characteristic HR1 95% CI1 p-value
rx


    Obs
    Lev 0.98 0.79, 1.22 0.9
    Lev+5FU 0.71 0.56, 0.89 0.003
age 1.01 1.00, 1.02 0.034
surg 1.30 1.06, 1.59 0.012
node4 2.50 2.06, 3.04 <0.001
local_spread


    contiguous
    muscle 0.34 0.21, 0.56 <0.001
    serosa 0.58 0.39, 0.84 0.004
    submucosa 0.24 0.08, 0.67 0.007
Concordance = 0.674
1 HR = Hazard Ratio, CI = Confidence Interval

Stratified Model Meets Proportional Hazards Assumption

Schoenfeld Residuals Test Results
chisq df p Variable
rx 2.001 2 0.368 rx
age 0.670 1 0.413 age
surg 0.014 1 0.905 surg
node4 4.288 1 0.038 node4
local_spread 5.298 3 0.151 local_spread
GLOBAL 12.411 8 0.134 GLOBAL

Model Evaluation Metrics

Model Description AIC BIC C_Index
Model 1 Base model 5860.383 5860.383 0.500
Model 2 Treatment 5852.236 5860.463 0.536
Model 3 Full variables 5741.401 5798.993 0.674
Model 4 Stepwise-selected variables 5737.261 5782.511 0.672
Model 5 Stratified 4567.829 4600.739 0.674

K-fold Cross Validation

#| echo: False
#| message: false
#| warning: false
#| include: False

library(survival)
library(caret)

# Fit the Cox model
cox_model <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 + local_spread, data = df)

# Calculate the original c-index
c_index_original <- survConcordance(Surv(time, status) ~ predict(cox_model), data = df)
#cat("Original c-index:", c_index_original$concordance, "\n")

# Create a function for calculating c-index in each fold using survConcordance
cox_cindex <- function(train_data, test_data) {
  fit <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 + local_spread, data = train_data)
  # Calculate concordance on test data
  c_index <- survConcordance(Surv(test_data$time, test_data$status) ~ predict(fit, newdata = test_data), data = test_data)$concordance
  
  return(c_index)
}

# Perform 5-fold cross-validation with stratification
K <- 5
folds <- createFolds(df$status, k = K, list = TRUE, returnTrain = TRUE)
cv_c_indices <- sapply(folds, function(train_indices) {
  train_data <- df[train_indices, ]
  test_data <- df[-train_indices, ]
  cox_cindex(train_data, test_data) # use the concordance function inside cox_cindex
})

# Print cross-validated c-indices
Original c-index: 0.6544784 
Mean cross-validated c-Index: 0.6420104 

Conclusions

References

[1]
S. J. Walters, “Analyzing time to event outcomes with a cox regression model,” Wiley Interdiscip. Rev. Comput. Stat., vol. 4, no. 3, pp. 310–315, May 2012.
[2]
C. A. Bellera, G. MacGrogan, M. Debled, C. T. de Lara, V. Brouste, and S. Mathoulin-Pélissier, “Variables with time-varying effects and the cox model: Some statistical concepts illustrated with a prognostic factor study in breast cancer,” BMC Med. Res. Methodol., vol. 10, no. 1, p. 20, Mar. 2010.
[3]
S. Abd ElHafeez, G. D’Arrigo, D. Leonardis, M. Fusaro, G. Tripepi, and S. Roumeliotis, “Methods to analyze time-to-event data: The cox regression analysis,” Oxid. Med. Cell. Longev., vol. 2021, no. 1, p. 1302811, Nov. 2021.
[4]
Z. Zhang, J. Reinikainen, K. A. Adeleke, M. E. Pieterse, and C. G. M. Groothuis-Oudshoorn, “Time-varying covariates and coefficients in cox regression models,” Ann. Transl. Med., vol. 6, no. 7, pp. 121–121, Apr. 2018.
[5]
Terry M. Therneau and Patricia M. Grambsch, Modeling survival data: Extending the Cox model. New York: Springer, 2000.
[6]
T. M. Therneau, A package for survival analysis in r. 2024. Available: https://CRAN.R-project.org/package=survival